Palmer Penguins: Plotly Visualization Tutorial

PIC16B: Homework 0

homework
Author

Kenny Guo

Published

January 22, 2025

Who doesn’t love penguins? In this plotly visualization tutorial, we’ll be examining the “palmer_penguins” data set, graciously collected and published by Dr. Kristen Gorman and the Palmer Station, Antarctica LTER (you can read more about the project and dataset here).

It contains data on 344 Anvers penguins of three species, Adelie, Chinstrap, and Gentoo, as well as various characteristics, such as their home island, the length and depth of their culmen, their flipper length, body mass, sex, and concentration of nitrogen and carbon in their bloodstream.

Let’s first import in all our necessary libraries. In this tutorial, we’ll be constructing a simple visualization using plotly express. We’ll also import plotly.io and use the renderers framework so our figure can be displayed on this webpage. We’ll also import pandas and numpy to help with our initial data wrangling. Then, we’ll save the dataset into a dataframe called penguins.

import pandas as pd
import numpy as np
from plotly import express as px
import plotly.io as pio
pio.renderers.default = "iframe"

url = "https://raw.githubusercontent.com/pic16b-ucla/24W/main/datasets/palmer_penguins.csv"
penguins = pd.read_csv(url)
penguins
studyName Sample Number Species Region Island Stage Individual ID Clutch Completion Date Egg Culmen Length (mm) Culmen Depth (mm) Flipper Length (mm) Body Mass (g) Sex Delta 15 N (o/oo) Delta 13 C (o/oo) Comments
0 PAL0708 1 Adelie Penguin (Pygoscelis adeliae) Anvers Torgersen Adult, 1 Egg Stage N1A1 Yes 11/11/07 39.1 18.7 181.0 3750.0 MALE NaN NaN Not enough blood for isotopes.
1 PAL0708 2 Adelie Penguin (Pygoscelis adeliae) Anvers Torgersen Adult, 1 Egg Stage N1A2 Yes 11/11/07 39.5 17.4 186.0 3800.0 FEMALE 8.94956 -24.69454 NaN
2 PAL0708 3 Adelie Penguin (Pygoscelis adeliae) Anvers Torgersen Adult, 1 Egg Stage N2A1 Yes 11/16/07 40.3 18.0 195.0 3250.0 FEMALE 8.36821 -25.33302 NaN
3 PAL0708 4 Adelie Penguin (Pygoscelis adeliae) Anvers Torgersen Adult, 1 Egg Stage N2A2 Yes 11/16/07 NaN NaN NaN NaN NaN NaN NaN Adult not sampled.
4 PAL0708 5 Adelie Penguin (Pygoscelis adeliae) Anvers Torgersen Adult, 1 Egg Stage N3A1 Yes 11/16/07 36.7 19.3 193.0 3450.0 FEMALE 8.76651 -25.32426 NaN
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
339 PAL0910 120 Gentoo penguin (Pygoscelis papua) Anvers Biscoe Adult, 1 Egg Stage N38A2 No 12/1/09 NaN NaN NaN NaN NaN NaN NaN NaN
340 PAL0910 121 Gentoo penguin (Pygoscelis papua) Anvers Biscoe Adult, 1 Egg Stage N39A1 Yes 11/22/09 46.8 14.3 215.0 4850.0 FEMALE 8.41151 -26.13832 NaN
341 PAL0910 122 Gentoo penguin (Pygoscelis papua) Anvers Biscoe Adult, 1 Egg Stage N39A2 Yes 11/22/09 50.4 15.7 222.0 5750.0 MALE 8.30166 -26.04117 NaN
342 PAL0910 123 Gentoo penguin (Pygoscelis papua) Anvers Biscoe Adult, 1 Egg Stage N43A1 Yes 11/22/09 45.2 14.8 212.0 5200.0 FEMALE 8.24246 -26.11969 NaN
343 PAL0910 124 Gentoo penguin (Pygoscelis papua) Anvers Biscoe Adult, 1 Egg Stage N43A2 Yes 11/22/09 49.9 16.1 213.0 5400.0 MALE 8.36390 -26.15531 NaN

344 rows × 17 columns

Data Wrangling

For this simple visualization, our goal will be to somehow distinguish the three species of penguins; Adelie, Gentoo, and Chinstrap. For this, we’ll use a 2D graph, so we’ll only need two features. For this, we’ll only use the 'Flipper Length (mm)' and 'Culmen Length (mm)' columns.

Note that some of the entries have NaN or missing values for flipper length and culmen length. We can treat these in a variety of ways, but for this example, we’ll simply remove them. As seen below, there were only 2 entries with missing values compared to 342 without, so this step is not too significant.

Finally, for ease of reading, we’ll drop the “penguin” and the scientific name from the 'Species' column.

# getting just these 3 columns
penguins = penguins[['Species', 'Flipper Length (mm)', 'Culmen Length (mm)']]
# dropping NaN values
penguins = penguins.dropna()
# getting just the species name
penguins["Species"] = penguins["Species"].str.split().str.get(0)

penguins
Species Flipper Length (mm) Culmen Length (mm)
0 Adelie 181.0 39.1
1 Adelie 186.0 39.5
2 Adelie 195.0 40.3
4 Adelie 193.0 36.7
5 Adelie 190.0 39.3
... ... ... ...
338 Gentoo 214.0 47.2
340 Gentoo 215.0 46.8
341 Gentoo 222.0 50.4
342 Gentoo 212.0 45.2
343 Gentoo 213.0 49.9

342 rows × 3 columns

Visualization with Plotly

Excellent! Now to visualize, we’ll use plotly’s scatter plot. To do this, we’ll create a figure using px.scatter. It takes in a variety of parameters, including:

  • Our penguins dataframe
  • What columns to plot on the x and y axes
  • color: colors the points based on their species
  • Width and height of the plot

We’ll also create some marginal histograms, which display the distribution of the data for one variable only. On the top we’ll see the distribution of flipper lengths across species, and on the right we’ll see the distribution of culmen lengths across species.

Finally, we’ll use the fig.update_layout function to adjust some of our plot aesthetics, by adding a title, adjusting the margins, and editing the template style.

fig = px.scatter(data_frame = penguins, 
                 x = "Flipper Length (mm)", 
                 y = "Culmen Length (mm)", 
                 color = "Species",
                 width = 800,
                 height = 500,
                 marginal_y = "histogram",
                 marginal_x = "histogram",
                  )

# Adjust the margins, add in a title, and set a plot template
fig.update_layout(margin={"r":0,"t":40,"l":0,"b":0}, 
                  title_text="Culmen Length vs. Flipper Length of the Three Anvers Penguin Species",
                  template="ggplot2")

# Show the plot
fig.show()

Discussion

From this plot, we can see some rough distinctions between the three species based on just these two features alone. Because of plotly’s nice interactive features, you can hover over any individual point as see the penguin’s species and it’s individual measurements.

From the marginal histograms, we can even more clearly see that the flipper lengths allow us to distinguish Gentoo penguins apart fairly well, as they feature significantly higher flipper lengths on average, while the culmen lengths allow us to distinguish the Adelie penguins fairly well, as they feature significantly lower culmen lengths. Hover over any bucket and plotly will display the count of penguins within that range.

This insight could be useful for potentially training some models to predict the species of a penguin given its phenotype, track the evolution of these species across time, or assess any other general trends in the species’ populations.